Super Tic Tac Toe Reinforcement Learning

August 8, 2024

Code

All code here can be found in this repository.

A Note

First of all, I'm sorry for not posting much this summer. I've been pretty wrapped up in some cool stuff.

Introduction

Super Tic Tac Toe is a variant of the classic game tic tac toe. Presh Talwalkar explains it well here. My goal with this short post is to design (and not run) a reinforcement learning procedure in order to have a Deep Q Network (DQN) learn the optimal strategy to win super tic tac toe. The structure of this post is as follows:

  1. Review of Reinforcement Learning
  2. Representing Super Tic Tac Toe Mathematically
  3. Learning Procedure

Note: I will not be training the model. I'm just exercising my understanding of reinforcement learning.

1. Review of Reinforcement Learning

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. The key components of an RL system are:

The objective is to learn a policy \(\pi\) that maximizes the expected cumulative reward over time, often defined as the return \( R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \), where \( \gamma \) is the discount factor.

2. Representing Super Tic Tac Toe Mathematically

Super Tic Tac Toe is a complex game comprising a \(3 \times 3\) grid where each cell itself is a \(3 \times 3\) Tic Tac Toe board. Let's denote:

The state \( s \) can be represented as a flattened array of size 81:

\[ s = \text{flatten}(\{L_{ij}\}_{i,j=0}^2) \]

The action \( a \) is choosing a cell in the \(3 \times 3\) grid of the designated small board. This can be represented as a single integer in the range \([0, 80]\) for simplicity.

3. Learning Procedure

3.1 Q-Learning

Q-learning is a value-based method of RL which seeks to learn the optimal action-selection policy:

\[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_a' Q(s', a') - Q(s, a) \right] \]

where:

3.2 Deep Q-Network

We use a neural network to approximate the Q-value function \( Q(s, a; \theta) \), where \( \theta \) are the weights of the network. The loss function for training the network is:

\[ L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] \]

where \( \theta^- \) are the weights of the target network, which are periodically updated to stabilize training.

3.3 Training Procedure

  1. Initialization: Initialize replay memory \( D \) and Q-network \( Q(s, a; \theta) \).
  2. Episode Loop: